Skip to content

Conversation

@andyzzhao
Copy link
Contributor

@andyzzhao andyzzhao commented Oct 3, 2025

Problem

Currently, some subscriptions fail to send because the asset export takes too long and Temporal times out. We have configured the Temporal start_to_close_timeout to be 15 min, but the asset export can take longer if the underlying queries are slow.

For example, in attempt=3 for this workflow run: https://grafana.prod-us.posthog.dev/goto/95mPwU3HR?orgId=1

5 of the 6 the exporting_asset logs (emitted when the export is started) have a corresponding asset_exported log. Asset 2401337 does not have a asset_exported log. 2401337's attempt started at 2025-10-03 15:25:36.305979 and the last asset_exported log was 2025-10-03 15:38:00.759231 belonging to the 5th asset. The timeout was at 2025-10-03 UTC 15:40:36.28 (Temporal logs).

My assumption here is that exporting 2401337 took longer than 2min and did not finish before 15:40 thus the Temporal start_to_close_timeout killed the activity.

The queries logs for attempt 3 are here: https://metabase.prod-us.posthog.dev/question/1611-subscription-35-queries-oct-3. Notice that the last query times out at 16:25.

#39055

Changes

  • Fail gracefully when an export takes too long and send the subscription with partial results. This is accomplished by setting an asyncio timeout shorter than the start_to_close_timeout.
  • Also add some logs to make debugging easier
  • tag clickhouse query with asset id to correlate query with a specific asset export

How did you test this code?

Unit tests

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Changelog: (features only) Is this feature complete?

pytestmark = [pytest.mark.asyncio, pytest.mark.django_db(transaction=True)]


@pytest_asyncio.fixture(autouse=True)
Copy link
Contributor Author

@andyzzhao andyzzhao Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking this because export_asset_direct makes the tests run for 2 min vs 14s with the mock.

============================= test session starts ==============================
PASSED [ 12%]
PASSED [ 25%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_excludes_deleted_insights PASSED [ 37%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_raises_if_missing_resource PASSED [ 50%]
PASSED [ 62%]
ee/tasks/test/subscriptions/test_generate_assets_async.py::test_generate_assets_async_handles_empty_dashboard PASSED [ 75%]
PASSED [ 87%]
PASSED [100%]
=============================== inline snapshot ================================
======================== 8 passed in 160.56s (0:02:40) =========================

we already have tests for export_asset which calls export_asset_direct in test_exporter.py

@andyzzhao andyzzhao marked this pull request as ready for review October 6, 2025 20:58
@andyzzhao andyzzhao requested a review from a team October 6, 2025 20:58
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. ee/tasks/test/subscriptions/test_generate_assets_async.py, line 193-194 (link)

    style: Consider using list() directly instead of lambda wrapper for better readability

  2. posthog/temporal/subscriptions/subscription_scheduling_workflow.py, line 197 (link)

    logic: Inconsistent timeout configuration. This workflow still uses the old calculation pattern while ScheduleAllSubscriptionsWorkflow was updated to use TEMPORAL_TASK_TIMEOUT_MINUTES. Should use the same timeout setting for consistency.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Size Change: 0 B

Total Size: 3.07 MB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 3.07 MB

compressed-size-action

@andyzzhao andyzzhao changed the title fix: fail gracefully when subscription exports take too long fix: fail gracefully when subscription a query takes too long Oct 7, 2025
PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES = get_from_env(
"PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES", 10.0, type_cast=float
)
TEMPORAL_TASK_TIMEOUT_MINUTES = PARALLEL_ASSET_GENERATION_MAX_TIMEOUT_MINUTES * 1.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this supposed to cover?

It looks like this is used for deliver_subscription_report_activity which has to wait to generate all the assets?

In the worst case, what happens if one of those asset generations takes the full 10 minutes and fails and has to be retried?

Copy link
Contributor Author

@andyzzhao andyzzhao Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is used for deliver_subscription_report_activity which has to wait to generate all the assets?

Correct

If one asset generation takes the full time, we've configured temporal to retry 2 more times. However, on each retry, we regenerate all the assets (potentially up to 6) again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool yeah, I guess we don't want to do that eh? Maybe better to use temporal for the asset generation workflows too so we can run them more independently. Sgtm for now.

@andyzzhao andyzzhao merged commit 7e2e2c7 into master Oct 8, 2025
252 of 255 checks passed
@andyzzhao andyzzhao deleted the andyzzhao/fix-subscriptions branch October 8, 2025 13:45
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/PostHog_posthog_pr_39132_6a5bf391-b3c0-4d0c-8e1a-1ccc1202a737 that referenced this pull request Oct 11, 2025
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/PostHog_posthog_pr_39132_6a5bf391-b3c0-4d0c-8e1a-1ccc1202a737 that referenced this pull request Oct 11, 2025
… takes too long

Merged from original PR #39132
Original: PostHog/posthog#39132
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants